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Computer analysis of complete prokaryotic genomes shows 
that microbial proteins are in general highly conserved - 
-70% of them contain ancient conserved regions. This allows 
us to delineate families of orthologs across a wide 
phylogenetic range and, in many cases, predict protein 
functions with considerable precision. Sequence database 
searches using newly developed, sensitive algorithms result in 
the unification of such orthologous families into larger 
superfamilies sharing common sequence motifs. For many of 
these superfamilies, prediction of the structural fold and 
specific amino acid residues involved in enzymatic catalysis is 
possible. Taken together, sequence and structure comparisons 
provide a powerful methodology that can successfully 
complement traditional experimental approaches. 
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Abbreviations 

COGs clusters of orthologous groups 
HAD haloacid dehalogenase 

Introduction 

The determination of the complete genome sequences 
of several bacteria and archea and one eukaryote 
jl-6.7"-12"| marked the beginning of a new age in biol- 
ogy. For the first rime, we can take a look at the com- 
plete set of proteins present in the cells of each 
particular organism and try to identify the proteins 
responsible for each cellular function. In cases where no 
known proteins can be found to perform a particular 
task, the most likely substitutes can be predicted from 
the set of unassigned gene products. Clearly this can be 
done only by analysis of complete genomes, as partial 
sequences do not allow us to ascertain that certain pro- 
teins are not encoded in a given genome [13], These 
new approaches are gradually changing our understand- 
ing of a variety of biological phenomena. As the number 
of sequenced genomes is expected to grow exponential- 
ly for the next few years, their impact on different bio- 
logical disciplines will increase. We have recently 
discussed the implications of the complete genomes for 
microbial evolution [14J. Here wc consider the effect of 
the genome revolution, together with the improving 
methods for sequence analysis, on our ability to predict 
and understand protein structure and function. 



Towards a natural taxonomy of proteins and 
protein ffamiOaes 

The numerous genome sequencing projects have resulted 
in a rapid growth of protein databases (see, e.g. [15]). In 
contrast to the pre-genome era, when researchers typically 
chose to clone and sequence genes with documented 
functional roles, we are now getting many protein 
sequences whose functions are not known. This presents 
a challenge to extract the most from these sequences in 
terms of salient features of the encoded proteins, for exam- 
ple to classify them according to their homologous rela- 
tionships, and to predict their possible catalytic activities 
and/or cellular functions, three-dimensional (3D) struc- 
tures and evolutionary origin. 

Protein classifications, pioneered by Dayhoff and her co- 
workers, have historically been based on sequence align- 
ments. Similar proteins formed families, which were 
combined into superfamilies [16]- This approach, contin- 
ued in the PIR database [17], proved extremely popular. 
However, even PIR superfamilies often unite closely 
related proteins and more distant relationships are being 
missed. Other protein databases, such as PROSITE [18], 
PRINTS [19], Pfam [20], and ProDom [21 J, group pro- 
teins on the basis of conserved sequence motifs and, gen- 
erally, contain much more diverse protein families. 
Structural comparisons of proteins, implemented in FSSP, 
CATH and SCOP databases, offer yet another approach 
to protein classification [22-24]. SCOP superfamilies, for 
example, unite proteins that have some similarities in 
their 3D structures, but often no detectable sequence 
similarity [25]. Thus, in the absence of clear sequence or 
structural similarities, the criteria for inclusion of distant- 
ly related proteins into a family (or supcrfamily) become 
increasingly arbitrary. 

With the inception of extensive genome sequencing, it has 
become possible to classify genes and proteins on a differ- 
ent principle, namely by delineating families of paraiogs — 
related genes within the same genome [26.27]. Such 
analyses have revealed a complex hierarchical organization 
of paralogous families in each of the studied genomes and 
produced at least two generalizations: first, the fraction of 
genes that belong to families of paraiogs increases with the 
increase of the total number of genes in a genome: from 
-25% in the minimal genome of Mycoplasma genita Hum to 
>50% in the large (for a prokaryote) Escherichia coli genome; 
second, the largest superfamilies of paraiogs are mostly the 
same in all genomes [28-33]. 

Knowledge of all the protein sequences from multiple com- 
plete genomes (Table 1) allows us to redefine the entire 
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Table 1 



Protein families and 3D structures in complete genomes. 



Species 


Proteins encoded in the genome* 


COGs found 
(% total) 


3D structures 




Total 


Belong to COGs* 




In PDB 


Predicted- 




number 


(% total) 








Escherichia coii 


4289 


2003 (47%) 


821 (95%) 


240 


667 


Haemophilus influenzae 


1717 


979 (57%) 


DOO \ 1 f /Of 


2 


267 


Helicobacter pylori 


1566 


841 (54%) 


617 (72%) 


0 


169 


Synechocystis sp. 


3169 


1551 (49%) 


796 (93%) 


2 


431 


Borrelia burgdorferi 


850 


483 (57%) 


363 (42%) 


0 


105 


Bacillus subtilis 


4100 


1945 (47%) 


732 (85%) 


12 


578 


Mycoplasma genitalium 


4 67 


341 (75%) 


290 (34%) 


0 


75/103 


Mycoplasma pneumoniae 


677 


378 (56%) 


309 (36%) 


0 


78 


Methanococcus jannaschii 


1715 


830 (48%) 


498 (58%) 


0 


170 


Methanobacterium thermoautotrophicum 


1869 


897 (48%) 


484 (56%) 


0 


199 


Archaeoglobus fulgidus 


2407 


1131 (47%) 


512 (60%) 


0 


290 


Saccharomyces cerevisiae 


5932 


1736 (29%) 


577 (67%) 


45 


846 


Caenorhabditis elegans 


12,178 


2172 (18%) 


466 (54%) 


2 


NA 



*The numbers are from the latest updates in the GenBank genome division (ftp://ncbi.nlm.nih.gov/genbank/genomes). C. elegans genome is about 
85% complete; the data are from Wormpepl2 (www.sangerac.uk/Projects/C_elegans/wormpep). *Based on the set of 860 COGs, obtained by 
adding H. pylori proteins to the original set of 720 COGs [37*']. *The numbers are from the PEDANT database [53*], calculated by comparing the 
protein set encoded in each genome to the PDB using FASTA with cutoff score of 1 20; the second figure for M. genitalium is from [54*]; the data 
for C. elegans are not available. 



problem of protein classification. Since the fraction of pro- 
teins conserved over large phylogenetic distances (ancient 
conserved domains) appears to be nearly constant at -707c 
in all prokarvotic genomes [34*]. it becomes feasible to 
replace more or less arbitrary clustering of proteins by simi- 
larity with consistent groups in which the evolutionary rela- 
tionships between the members are specifically defined. 
Such a classification of proteins can provide a framework for 
evolutionary studies and for rapid, largely automatic, func- 
tional annotation of newly sequenced genomes. 

Several classifications of homologous proteins encoded in 
complete genomes have been produced, based on all- 
against-all protein sequence comparisons [35.36,37"]. Each 
of these projects is aimed at the identification of orthologs, 
that is direct counterparts in different genomes, connected 
by an uninterrupted line of vertical descent and typically 
retaining their physiological function (26,27]. In particular, 
the system of clusters of orthologous groups (COGs) was 
designed to accommodate the vastly different evolution 
rates observed for different genes [3 7"). The COGs con- 
struction procedure identifies the closest homologs in each 
of the sequenced genomes for each protein, even if the sim- 
ilarity is fairly low and not statistically significant by itself. 
The approach to the identification of COGs was built upon 
the transitivity of orthologous relationships, that is the sim- 
ple notion that any group of at least three genes from dis- 
tant genomes, which are more similar to each other than 
they are to any other genes from the same genomes, is most 
likely to belong to an orthologous family. Clearly, this is a 
probabilistic assumption based on a 'weak molecular clock 
concept', which posits that oithologs are more similar to 
each other than they are to para logs with different, even if 



related, functions. This assumption, however, seems to 
hold true in cases where wc have reasons to accept ortholn- 
gy on functional grounds (for example. aminoacyl-tRNA 
synthetases or ribosomal proteins). Orthology is not neces- 
sarily a one-to-one relationship, as in cases of lineage-spe- 
cific duplications, orthology can only be established 
between families of paralogous genes. Such complex rela- 
tionships require caution in the functional interpretation of 
the phylogenetic classification of proteins. Nevertheless, 
about 60% of the original set of 720 COGs [37"] are simple 
families, with no paralogs or with paralogs from one lineage 
only, suggesting the possibility of straightforward transfer of 
functional information from functionally characterized 
genes from model systems such as E. cofi and yeast to tho^c 
from poorly characterized genomes. 

The utility of this system of protein classification was test- 
ed on several newly sequenced bacterial, archeal and 
eukaryotic genomes. Interestingly, with chc only exception 
of the minimal genome of M. gemtalhwu the fraction of the 
proteins that belong to the COGs — ancient families con- 
served across a wide phylogenetic range — is about the 
same and very close to 50SF for all prokarvotic genomes 
("J able 1 ). This is clearly compatible with the previous esti- 
mate that about 70?< of the proteins encoded in each 
genome contain ancient conserved regions. The fraction «•! 
the proteins included in the COGs is at this time io\\c»- 
which is evidently due to the requirement for three disrunt 
lineages to be included, and to the limited number 
species in the first instalment of the COGs. There is Mnu* 
doubt that with new genomes added, the number of ( - 1 ^ r> 
will asymptotically approach the total number of ancicm 
conserved regions. By contrast, this fraction is much l'»v^' r 
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for cukaryotic genomes, indicating the prevalence of 
e n k a r y o t c - s p c c i fi c f a m i I i e s . 

Comparison of the new proccin sees with the COGs result- 
ed in a number of functional predictions for previously 
uncharactcrized proteins. Even for the Helicobacter pylon 
proteins, most of which show highly significant similarity to 
homologs from E. coli and other bacteria and have been 
described in considerable detail [8"|, predictions were made 
in more than 1(K) cases (http://www.nebi.nlm.nih/COG): 
function was also predicted for a number of a re heal and 
worm proteins (EV Koonin. RL Tatusov. MY Galperin, 
unpublished data). 

Missing gene families and evolution of 
metabolic pathways 

Comparative analysis of the available complete genomes 
shows that metabolic diversity generally correlates with 
genome size. Parasitic bacteria import a variety of metabo- 
lites, which allows them to shed genes encoding enzymes 
for many or even most of the metabolic pathways [1-3, 
8"\33..i8]. In contrast, all ceils have to rely on their own 
gene products for performing such essential functions as 
genome expression, replication and repair, and membrane 
biogenesis and others. These tasks alone require at least 
about 200 genes [1.U7"). 

Given complete genome sequences, classification of pro- 
teins into orthologous groups provides a convenient way to 
systematically survey the protein families present or 
absent in a genome and to identify the metabolic pathways 
that are likely to be operative in the organism analyzed. 
When some of the required enzymes cannot be found in 
the genome, the respective pathways are either not opera- 
tive, or use other, unrelated, proteins to catalyze the miss- 
ing steps (see [39]). An example of such an analysis, which 
included superposition of the phylogenetic patterns 
derived from the COGs [37**], over the scheme of glycoly- 
sis, reveals several interesting trends (Figure 1 ). Glycolysis 
includes three reactions that in different species are cat- 
alyzed by non -orthologous enzymes, namely phosphofruc- 
tokinases, aldolases and phosphoglycerate mutascs. 
Interestingly, the second phosphofructokinase in E. coli. 
encoded by the pfkB gene; has apparently been recruited 
from a ubiquitous family of ribokinase-like sugar kinases. 
The ribokinase COG seems to be an example of a complex 
family in which the exact orthologous connections are not 
always easy to trace. In particular, even though PfkB for- 
mally belongs to the COG. there seems to be no actual 
orrholog of it in other genomes. Thus H. pylori does not 
encode a phosphofructokinase at all. although it has genes 
for other kinases of the ribokinase family and, accordingly, 
is represented in the respective COG (Figure I). 

A remarkable case of non-orthologous gene displacement 
involves two unrelated forms of phosphoglycerate mutase. 
the 2.3-bisphosphoglycerute (BPG)-dependent and the 
BPG-independcnt one. While H. influenzae and Borrelia 



burgdorferi encode only the BPG-dependent form, and H. 
pylori, mycoplasmas, and archea encode only the BPG- 
independent form (see [40]). free-living bacteria such as £. 
coli. Bacillus subtilh and Synechorystis sp. possess genes cod- 
ing for both these forms, with two paralogs of the BPG- 
dependent one (Figure 1). Phosphofructokinase, aldolase 
and tructose bisphosphatase genes arc all missing in the 
archea (Figure 1), in accordance with the experimental 
data [41 1. This is consistent with the idea that glycolysis 
originally evolved as a biosynthetic pathway, containing 
only the lower (tri-carbon) part [42]. 

Systematic identification of missing links in functional sys- 
tems in organisms for which complete genome sequences 
are available is probably the most important application of 
protein family classification. Conspicuous gaps in the H. 
pylon metabolism became apparent from the COG analy- 
sis, suggesting major revisions to the general scheme of the 
central metabolic pathways in this bacterium (Table 2). In 
particular, unlike most other bacteria (and all with com- 
pletely sequenced genomes), H. pylori seems to possess 
neither glycolysis nor the pentose phosphate shunt, the 
Entner-Doudoroff pathway being the only major route of 
sugar catabolism. Indeed, sugar fermentation, resulting in 
intracellular acid production, would be an additional bur- 
den on the pH maintenance mechanism in this bacterium, 
which has to survive in an external pH of 2-3. By contrast, 
gluconeogenesis. which converts organic acids into sugars 
required for nucleic acid and peptidoglycan biosynthesis 
and thus removes H' from the cytoplasm, appears to he 
f u 1 1 y f u nc t i o n a I i n //. pylori. F o r t h e p u r pose o f e n e rgy p ro- 
duction, H. pylori apparently depends on amino acid fer- 
mentation, which causes alkali nidation of the cytoplasm 
and thus relieves part of the problem of pH maintenance. 
Amino acids and oligopeptides that serve as substrates for 
this fermentation are produced by gastric proteolysis and 
transported by readily identifiable permeases. 

From genomes and families to superfamilies 
and folds 

Classification systems aimed at the identification of fam- 
ilies of orthologs make no attempt to capture the more 
subtle conserved motifs in proteins, which reflect 
ancient relationships at the level of superfamilies and 
frequently are critically important tor understanding pro- 
tein functions and structures [43.44], Computer methods 
for the detection of such motifs and delineation of super- 
families have lately progressed significantly through pro- 
grams such as BLIMPS/MCI TIM AT |45|, Probe [4r>|. 
and PSI-BLAST (47"|. which combine pairwise 
sequence comparisons with profile analysis. PSI-BLAST. 
in particular, has proved to be a powerful tool for the 
detection of subtle sequence motifs, resulting in the dis- 
covery of a number of unsuspected superfamily relation- 
ships [47". 48*]. Furthermore, one of the perhaps 
under-appreciated benefits of the accumulation of 
genomic sequences is the greatly improved capacity to 
identify even very subtle sequence similarities due to 
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Figure 1 
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Glycolytic enzymes in organisms with completely sequenced genomes. The enzymes are listed under £ coli gene names. The COG numbers are 
as in COG database (www.ncbi.nlm.nih.gov/COG, [37-*]) (where available). Shaded arrows indicate reversible reactions, black arrows practically 
irreversible ones. Phosphoenolpyruvate synthase-catalyzed reaction in the direction of phosphoenolpyruvate hydrolysis has been demonstrated in 
vitro. Phyiogenetic patterns are: e, Escherichia coli; h, Haemophilus influenzae; u, Helicobacter pylori\ b, Bacillus subtifis; g, Mycoplasma 
genitafium; p. Mycoplasma pneumoniae; I, Borrelia burgdorferi; c, Synechocystis sp.; m, Methanococcus jannaschii; U Methanobacterium 
thermoautotrophicum; f, Archaeoglobus fulgidus; y, Saccharomyces cerevisiae; w, Caenorhabditis efegans. 



the increasingly uniform population of the protein uni- 
verse by these relatively unbiased sequence sets, of 
which the new methods for sequence analysis mentioned 
above can take advantage [49"). 

In the past year, we have seen the identification or signif- 
icant extension of a number of protein supcrfamilies; 
some examples, with the distribution among complete 
genomes, are shown in Table 3. Most of these supcrfami- 
lies are universally found in all genomes, with the counts 
more or less proportional to the total number of genes in 
the genome. Some expansions are. however, remarkable. 



such as, for example, urease-related hydrolases and A I 1'- 
grasp domains in the archea, and HAD superfamily hydro- 
lases in E. coli and B, subtilis (Table 3). In certain cases, the 
phyiogenetic distribution of a superfamily immediately 
suggests major evolutionary events. Thus the BKC- I 
domain is present in a single copy in the DNA ligase oi all 
bacteria (with one additional copy found only 1,1 
Syriechorystis). is missing in the archea. and is dramatic;) lb 
expanded in its distribution in the eukaryotes (Tabic .v 1 - 
The most obvious interpretation of this distribution is rii Ji 
this domain has entered the eukaryotic world by hori/'" 1 " 
tal gene transfer from bacteria and has undergone ex ten- 
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Table 2 



Genes and pathways missing in Helicobacter pylori. 



Enzyme activity 


£ co// gene 


COG number 


Status in H. pylori 


Implications for hi. pylon metabolism 


Phosphofructokinase 


pfkA 


COG0206 


Missing 


Absence of the two key glycolytic enzymes shows that 




pfkB 


COG0525 


Present (ribokinase) 


Embdpn-Mpv^rhof nathwav i^ nnt funrtinnal in H rw/lnri 


Pyruvate kinase 


pykA 


COG0470 


Missing 






pykF 






fructose bisphosphatase (HP1 385) and 










phosphoenolpyruvate synthase (HP0121), are present in 










Fl. pylori, allowing it to produce sugars required for 










peptidoglycan biosynthesis. 


\J (J) lUo|JI lUyiUOUMdlt; 


gna 


UUuUOoU 


Missing 


Pentose phosphate pathway is also not functional. Even 


vj^i ivui uuci laac 








though H. pylori has a ribose 5-phosphate isomerase 




fpiA 


f^(~\r*T\ 1 on 


Missing 


encoded by an ortholog of the E. coli rpiB, no gene coding 


l owl 1 l aotz 








for 6-phosphogluconate dehydrogenase could be identified. 










The only saccharolytic pathway in H. pylori appears to be 










the Entner-Doudoroff pathway. 


Lipoate synthase 


lipA 


V^WVJJWO I o 




Pyruvate dehydrogenase complex is absent in H. pyfori\ 


Lipoate~protein 


fpiA 


V-/ V-J \J *T 1 1 




acetate kinase and phosphotransacetylase are not 


ligase 


lipB 


COG031 9 


\A iQQinn 


lUMuiiuiicit. ryruvdm iciitjcjOAin oxiuoreuucidS*? is ine oniy 


r)i hurl ro! inon mi fit* 


aceF 




Missing 


acetyl-CoA-producing enzyme in H. pylori 


uvjlll CXI CI OC 










Acetate kinase 


3ckA 














iroiniesriiri 




Pho^nho- 


pa 




Disrupted by 




t ran sacety lass 






■fro mochiftc 
1 1 ell T ItJol in lc> 




Enzymes of purine 


purF 


COG0034 




use uuw purine uiuoyrunesis ib duseni in n. pyion, dnu u 


biosynthesis 


purD 


v*/ v»j kj i *j i 


indi>iivdic?u uy 


- has to obtain purines from the host. HP1 185 appears to be 








mutations 


the best candidate for the purine permease, as it is the oniy 




N 




Missing 


H. pylon protein, similar to £. coli Pur P. 




P purT 




Missing 




purL 1 


COG0046 








puri_2 


COG0047 


Missing 


On the other hand, H. pylori encodes the enzymes for AMP 




purM 


COG0150 


Missing 


and GMP synthesis from IMP and their interconversion. 




purK 


COG0026 


Missing 


Therefore, it can survive on any of these purines. 




purE 


COG0041 


Missing 






purC 


COG0152 


Missing 






purhf 


COG0138 


Missing 






purA 


COG0104 


Present 






purB 


COG0015 


Present 






guaB 


COG0516 


Present 






guaA_1 


COG0518 


Present 






guaA_2 


COG0519 


Present 





sive duplication with divergence in the eukaryotes. The 
expansion of this domain into a number of eukaryotic pro- 
teins involved in cell-cycle control [5()"\51] may have 
been critical for the very establishment of these systems. 

With the current acceleration in protein structure determi- 
nation [22,24], a superfamily identified by sequence com- 
parison more and more frequently extends to include 
proteins with known 3D structure and/or we II -character- 
ized catalytic mechanism (Table 3). Such findings are 
sometimes most illuminating as they immediately result in 
the prediction of the structural fold, the structure of the 
active center, and possibly also the catalytic mechanism for 
a wide variety of diverse proteins comprising the super- 
family. This is illustrated by the recent prediction of the 



structure and the catalytic amino acid residues For P- 
ATPases. which remained elusiv e in spite of a long history 
of studies, on the basis of the sequence motifs shared with 
haloacid dehalogenases [52*|. 

Assignment of the gene products to structural folds and fam- 
ilies with maximal attainable precision is arguably one of the 
foremost tasks of genome analysis after the sequencing 
phase. The number of structures that have been determined 
experimentally is negligible for almost all genomes, with the 
exception of E. coli (where it is still rather a small fraction) 
(Table 1 ). A database search with a deliberately conservative 
similarity cut-off already increases the fraction of proteins for 
which a confident structure prediction is possible to 10-25% 
[53*] (Table 1). Secondary structure-based threading allows 
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another relatively small but notable increase in the predictive 
power [54*| (Table 1). It appears, however chut ar this time, 
rhe most realistic way to further structure prediction at 
genome scale is to perform a complete analysis of protein 
supcrfainilies as exemplified in Table 1 

Perspective 

As far as pro kary otic genomes are concerned, we have 
already entered the post-genomic era. While surprises 
certainly wait ahead, there is little doubt that the major 
protein families arc already known or can be deciphered 
from the available sequences. We have recently seen 
major progress in methods and procedures for advanced 
sequence analysis, and a lot of valuable information has 
been extracted from the genomes. We believe, however, 
that a major focused effort in genome comparison is still 
required in order to construct a proper classification of 
protein families and supcrfamilies and systematically 
apply it to the goals of structural and functional predic- 
tion. Such an effort will have the potential of creating a 
basis for a rationally designed, decisive onslaught on 
structure determination and experimental identification 
of gene functions using computer predictions as a guide. 
Hopefully, this research program turns out to be both 
realistic and efficient. 
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